Skip to main content

Sentiment Analysis

e.g. Is this IMDB movie review a positive one? There are keywords that can be used to determine the attitute on a movie.

DNN-based NLU

Feed downstream task data to pre-trained language model to fine-tune. Downstream task data should be as similar to deployment data.

Tokenizer

Crucial component of NLP systems.

NLP systems need a tokenizer to encode texts into numbers in order for NN to perform calculations.

Word Splitting

Method 1: Split

  • "convert" and "converts" are considered as different words
  • Punctuations make a different word

Method 2: separate words and punctuations

Drawback: larger vocab in multilingual tasks. "This is an example." -> ["This", "is", "an", "example", "." ].

Method 3: Character/Byte-level Encoding

Example: CANINE Vocab size significantly reduced. "This is an example." -> ["T", "h", "i", ...]. Drawback: need to handle a much longer sequence.

Method 4: subword

  • Adopted by popular LMs (BERT, GPT)
  • Each pretrained model comes with its own tokenizer

Two-step Encoding Process

Calling tokenizer(sentence) is equivalent to

  • tokens = tokenizer.tokenize(sentence)
  • tokenizer.convert_tokens_to_ids(tokens)

Overview of the Pipeline

Text -> Tokenization into input_ids and attention masks -> pretrained model -> linear classification head -> output.

BERT

BERT is the encoder part of the transformer.